Evidence for wastewaters as environments where mobile antibiotic resistance genes emerge

The emergence and spread of mobile antibiotic resistance genes (ARGs) in pathogens have become a serious threat to global health. Still little is known about where ARGs gain mobility in the first place. Here, we aimed to collect evidence indicating where such initial mobilization events of clinically relevant ARGs may have occurred. We found that the majority of previously identified origin species did not carry the mobilizing elements that likely enabled intracellular mobility of the ARGs, suggesting a necessary interplay between different bacteria. Analyses of a broad range of metagenomes revealed that wastewaters and wastewater-impacted environments had by far the highest abundance of both origin species and corresponding mobilizing elements. Most origin species were only occasionally detected in other environments. Co-occurrence of origin species and corresponding mobilizing elements were rare in human microbiota. Our results identify wastewaters and wastewater-impacted environments as plausible arenas for the initial mobilization of resistance genes.

The study entitled "Evidence for wastewaters as environments where mobile antibiotic resistance genes emerge" aims to demonstrate that wastewater is a privileged environment where antibiotic resistance genes can be acquired by novel hosts. The manuscript refers to an interesting data set that describes the co-occurrence of abundant bacterial groups and genetic elements involved in recombination events in wastewater environments. The study is a continuation of a previous publication by the same group -Commun Biol 4, 8 (2021). The narrative of this submission elaborates on the acquisition and mobilization of novel antibiotic resistance genes based on a probabilistic rationale. Important ecology and evolution concepts are largely neglected. Despite the merits of the approach used, the results presented are not conclusive and may lead to incorrect interpretations. Some aspects should be (re)considered: 1. It assumed that members of a given species will have the same fitness in all environments where they can thrive (e.g. wastewater or human stool). This is incorrect. Beside other factors may be strongly influenced by the microbial community composition, density (cells per volume or area), exogenous stress factors, among many others. Another assumption that may be not straightforward is that different strains of a given genus (or species) will have identical fitness under a specific conditions set.
2. It is assumed that if some taxa and some genetic elements prevail in a given environment (cooccurrence), they are likely associated. Although this is a logical principle, it is not necessarily true, and therefore should not be tacitly accepted unless evidence of such association is demonstrated. However, co-occurrence is the strongest argument provided.
3. It is assumed that the process of acquisition of novel antibiotic resistance genes involves the most abundant bacterial groups in a given environment. If there is any evidence that demonstrates this, it should be clearly provided in the manuscript. Stochastic events are probably part of the resistance acquisition process. Also, the spacial distribution of donor and recipient bacteria may be important in this process and it cannot be predicted based on abundance. Hence, it may be erroneous to generalize that acquisition of novel resistance genes will involve the most abundant species or a specific type of environment. Other comments: It is difficult to understand how the authors identified the "origin species", which criteria or information sources were used? -this is not for the reader. L11-The question may be not where, but when and how antibiotic resistance is acquired for the first time. L39 -"origin species" is misleading; indeed it seems the authors are referring to founder effects and founder species; The hypothesis of the study is not totally clearly stated and aligned with experimental design; L108 -which species hold these MISE? Which is the average MISE number per bacterial genome? How different is that average number in different environments? This information might be useful for a better understanding of the data and the results. L115 and following -Since the methods are the last section of the paper, there should be a better description of procedures in the results section, to allow the reader to follow a rationale. For instance, it would be beneficial to have a brief explanation of the database and the optimization procedure, the criterion to select "origin species", etc.. L117 -It seems that this validation may have some bias, since with genomes the probability of unspecific identification is much lower than in a metagenome data set. Was this normalized taking such bias into account? Most of the MISE presumably associated with antibiotic resistance genes and IS (figure 3) are indeed observed in raw influent. Although the number of metagenomes analysed is small, it seems that the relative abundance of these genetic elements is reduced during wastewater treatment. According with these data, and to support the hypothesis raised in the title of manuscript, which would be the preferential niche for antibiotic resistance acquisition? The plumbing system? In general, there is an important bias for proteobacteria in wastewater, so the conclusions are strongly influenced by this fact. It is assumed that because wastewater have higher abundance of Proteobacteria than other environments, they are probably acquiring novel genes in that environment. Although this is a logical argument, it seems to lack strong evidence. When wastewater is compared with human stool, it might be more accurate if one considers the ratio "origin species": members of the same phylum in that environment. Otherwise, all data is strongly biased by abundance.

Summary
The authors use a set of species to look for them in public metagenomic data from different environments. The metagenome analysis shows that wastewaters and human stools have a high abundance of the set of species. They analyze some ISs in the set of species and find that IS content is the highest in hospital effluent, poultry faeces and WWTP. Their results suggest that residual water could be a site for the mobilization of antibiotic-resistance genes. However, there are important issues that need to be addressed before this study can be considered for publication.

Major comments
1. There is an important piece of data or analysis missing. The authors did not present the data where one can see that the ISs are carrying antibiotic resistance genes (ARGs). The authors conducted analyses showing the presence or absence of the ISs in the origin species but it's not shown if these ISs have ARGs. Equally important, the authors basically just analyze ISs and not a good diversity of Mobile Genetic Elements (i.e. plasmids, phages, pathogenicity islands, etc.). Considering this issue, the title and abstract are misleading. Given what is stated in the introduction, specifically lines 40-47, I assume they have already conducted the analysis linking ISs to ARGs in the "origin" species in a previous study, if so please say that explicitly.
2. Why did the authors choose the species listed in Table 1 as their origin species? Please give the reason behind it. There are some ESKAPE pathogens but also some other bacterial species that their clinical relevance is not obvious. Again, I think the authors used the set of species considered in a previous study by them. Introduction Line 50/51: reference 16: to my understanding this reference makes no claim supporting this statement. This article highlights that the development of methicillin resistance predates the clinical use of beta-lactams, not that the origin species evolved so far that it cannot be identified anymore. Maybe the authors could rephrase and explain what they meant by this statement-or whether their statement is an interpretation of the results / study they are referring to.
Lines 69 to 79 are based on 2 review papers and the authors hypothesis only. It does also not consider natural lifestyle of bacteria (biofilms), and the role of free and external DNA and naturally competent cells in complex environments (neither in the environment studied) The first paragraph of the results section is very difficult to understand and seems to fit better into the methods section. The definition of origin species is not clear and requires the reader to fully study the previously published work by the same authors (reference 15). I would suggest to re-phrase this paragraph and to explain how the origin species are defined. The optimized database should be explained (which criteria) and referenced. The authors do not describe clearly how small fragments of 100 to 150bp from metagenomic sequence data are enough to identify as the origin species (what type of fragments? Filtered 16S rRNA gene fragments? Or several reads mapping on different part of the genome of the origin species? How is that specific enough to allow the identification of a species based on small reads? It could be a fragment that originates from another species (the core genome for example) or genus closely related). It would be great if the authors could explain how they identified / classified short metagenomic sequencing reads on species level. It seems like a truly innovative and valuable approach; however, it is very difficult to grasp (also when reading their other paper in which they curated the database of origin species) the way it is written. I would suggest an extended method section reintroducing the method, the theory, the definition of origin species. Technical and statistical details as to validating their data by using TPR and FPR should be in supplementary data. Figure 1 could use some clarifications for example how many origin species are represented (all? Is that the full database). The fact that the two Indian sites contained the highest amount of origin species is interesting -would that not indicate that rather the pollution/ selection level of antibiotics is determining the abundance of those species not the environment per se? Hence the fact that wastewater contain chemical and pharmaceutical pollutants might explain why these origin species are e.g. higher in WW than in the human gut on average for example? This would also correlate with findings in lines 179-185. Figure 2 -the authors specified that they used all genomes available for the origin species originally identified to carry non mobile ARGs -meaning for 10 species there were less than 5 genomes available for this analysis? Is it safe to assume that because the IS elements were not present on those genomes that they were not present before or in other isolates of the respective species (not sequenced or not included into analysis)? Line 262 to 267 and Figure 4: It is not clear which fraction of the presented data are water or soil samples. Are they mixed? Or does one-part of the figure 4 a and b represent water and the other part soil samples? Please clarify. Interesting finding that MISE and origin species pairs were associated with crAssphage in all environments but the human gut? Any explanation?

Discussion
The discussion states the most important findings and provides a hypothesis to explain these findings. The discussion should be shortened and focus more on the recent literature of other groups. In general, the discussion lacks references for certain claims (lines: 309, 310; 339; 349,350; 374, 375405, 406;416; 416-419). Line 362: Do the authors mean treated hospital wastewater by hospital effluent? And WWTP influent refers to untreated, urban wastewater? Lines 383-399 discusses the authors implemented approach. This should rather go to the beginning or the end of the discussion-and as mentioned before also partly into the results and methods section so the reader can understand the approach (without having to fully study their previous paper). Lines 400 to 419 are redundant of what has been concluded in the text before. Concerning some aspects that are missing in the discussion: For example, a recently published paper by Che et al., https://doi.org/10.1073/pnas.2008731118 suggested that a general evolutionary mechanism for the horizontal transfer of AMR genes is mediated by the interaction between conjugative plasmids and ISs. Are the most relevant identified ISs in this study similar / the same as the studied MISE here?
The authors could consider including the identified conjugative plasmids and ISs by Che et al, in their analysis pipeline. This could be a very important addition and a means to compare different analysis approaches to see whether results are comparable-and extend the findings of Che et al., to the large environmental metagenomic datasets used here. The fact that the wastewater environment is on average and in a more permanent way an environment under high selective pressure containing chemical, pharmaceutical, heavy metals, micropollutants and surfactants, than the human gut which is only sporadically exposed to e.g. high doses of antibiotics or other chemicals/pharmaceuticals should be considered and discussed in the context of current literature. How important is the presence of the origin species in a certain environment compared to the presence of ISs or MISEs (the latter seems much more relevant)

R1:
The study entitled "Evidence for wastewaters as environments where mobile antibiotic resistance genes emerge" aims to demonstrate that wastewater is a privileged environment where antibiotic resistance genes can be acquired by novel hosts. The manuscript refers to an interesting data set that describes the co-occurrence of abundant bacterial groups and genetic elements involved in recombination events in wastewater environments. The study is a continuation of a previous publication by the same group -Commun Biol 4, 8 (2021). The narrative of this submission elaborates on the acquisition and mobilization of novel antibiotic resistance genes based on a probabilistic rationale. Important ecology and evolution concepts are largely neglected. Despite the merits of the approach used, the results presented are not conclusive and may lead to incorrect interpretations. Some aspects should be (re)considered: 1. It assumed that members of a given species will have the same fitness in all environments where they can thrive (e.g. wastewater or human stool). This is incorrect. Beside other factors may be strongly influenced by the microbial community composition, density (cells per volume or area), exogenous stress factors, among many others. Another assumption that may be not straightforward is that different strains of a given genus (or species) will have identical fitness under a specific conditions set.
Authors' reply: It is, indeed, true that fitness for different members of different species varies across different environments but, in contrary to what the reviewer says, we do not make any assumptions regarding fitness in this manuscript. Our assumption is that a higher abundance of donor species and corresponding mobilizing elements (given everything else is similar) will increase the chance of a mobilization event to have taken/to take place. And as we observe high abundances of many donor species and mobilizing elements in e.g. sewage, we do not rely on theoretical assumption about their fitness.
We do, however, discuss why the abundance of certain species is higher in certain environments (see e.g. paragraph 2, 3 and 4 in the discussion), and certainly many factors indeed can affect the fitness.
This comment is related to comment 3, so please also see answer to comment 3.

It is assumed that if some taxa and some genetic elements prevail in a given environment (cooccurrence), they are likely associated. Although this is a logical principle, it is not necessarily true, and therefore should not be tacitly accepted unless evidence of such association is demonstrated. However, co-occurrence is the strongest argument provided.
Authors' reply: We do not make any assumption that a given taxa and specific mobile genetic elements are genetically coupled/associated. From the metagenomic data, we do not have the possibility to determine if the fragments we identify belong to the same cell or not. What we do is that we investigate if they are present in the same physical environment (not necessarily in the same cell), and if they are, we assume a higher possibility of them to become genetically associated. The opposite is apparent; if they are not found in the same physical environments, certainly chances for becoming genetically associated are lower.

It is assumed that the process of acquisition of novel antibiotic resistance genes involves the most abundant bacterial groups in a given environment. If there is any evidence that demonstrates this, it
should be clearly provided in the manuscript. Stochastic events are probably part of the resistance acquisition process. Also, the spacial distribution of donor and recipient bacteria may be important in this process and it cannot be predicted based on abundance. Hence, it may be erroneous to generalize that acquisition of novel resistance genes will involve the most abundant species or a specific type of environment.

Authors' reply:
In contrast to what the reviewer says, we do not assume or state that the process of acquisition of novel ARGs involves the most abundant bacterial groups in a given environment. What we have done in this study is that we have compared the abundance of 22 species that already have been shown to be the recent origin species for certain ARGs, in various environments. We are not comparing the abundance of one species to that of another. We only compare one species at a time across environments to see where they are more or less abundant (or even absent).
What is correct is that we assume that a higher abundance of an origin species in one environment compared to another would lead to more chances for mobilization to have occurred through stochastic processes (given everything else is similar). It should however be pointed out that this was only one of the factors we considered. We also determined where the investigated origin species and the mobilizing genetic elements involved in mobilization of ARGs from the investigated species were present and abundant (i.e. above/below detection limit) -since both the species and corresponding mobilizing elements are needed in order for mobilization to have occurred, and environments where this criterium is not fulfilled simply could not have been where mobilization occurred (given that the condition in the environments stays the same). There has also been new evidence pointing towards that several ARGs have been repeatedly mobilized from the chromosome of their origin species, which would speak in favour of an environment where these conditions are constantly (or at least regularly) fulfilled as the site of mobilization. We have added this argument in the last paragraph of the discussion (lines 438-442).
The reviewer is correct that the spatial distribution within a given environment can also play a role (probably more so in solid media such as soils, versus liquid media such as wastewaters where bacteria move more freely), but that does not make abundance unimportant. The same thing goes for selection pressure from antibiotics, which we also discuss in the manuscript (paragraph 2, 3 and to some extent 4).
We have given some thought to why the reviewer seems to have misinterpreted our study on several points, and based on comments given by reviewer 4 we have substantially extended and added information about our method and the background of our study (i.e. the origin species). See lines 39-49 in introduction, a new 1 st paragraph in Results as well as a new 1 st paragraph in Methods, together with more clarifications throughout the text. We have also removed the (previously) 4 th paragraph of the introduction where we describe a later part of the chain (i.e. horizontal gene transfer), as we suspect that this paragraph added confusion about the aim of the study.

Other comments:
It is difficult to understand how the authors identified the "origin species", which criteria or information sources were used? -this is not for the reader.

Authors' reply:
We suspect that many comments given by both reviewer 1 and 2 are based on misinterpretations of our study and the concept of origin species. The might partly be because they have not thoroughly read the current manuscript and the study by Ebmeyer et al., which this study is based on, but also because we have been unclear. We have therefore extended the background information about the concept of mobilization and origin species with lines 29, 39-48 in the introduction, a new first section in results as well as a new first section in the methods.
For the reviewer: The origin species we investigated in this study have not been established by us in more than in a few cases -they have been discovered by many different research groups but were summarized in a paper by us in 2021 (Ebmeyer et al., 2021). This should now be more clearly stated in many places in the manuscript.
L11-The question may be not where, but when and how antibiotic resistance is acquired for the first time.
Authors' reply: We do not agree with the reviewer. If we don't know where resistance is acquired for the first time we risk on focusing our mitigation efforts on the wrong environments. Furthermore, one does not exclude the other. While the information on how and when antibiotic resistance is acquired also is important, the focus of this study is "where".

L39 -"origin species" is misleading; indeed it seems the authors are referring to founder effects and founder species;
Authors' reply: As mentioned above, we believe that the reviewer has misunderstood the concept of this study. We are very aware of the concept of founder effects in ecology, but what we investigate has nothing to do with founder effects. As mentioned in the comment above, we have now clarified the concept of origin species on several locations in the manuscript.

The hypothesis of the study is not totally clearly stated and aligned with experimental design;
Authors' reply: The aims of the study are clearly stated in the last paragraph of the introduction. We believe that if one understands the concept of origin species and MISE, the aims of the study will become clear. Therefore we have, as mentioned above, added more background and explanations of the concept and the study. We also removed a paragraph (previously paragraph 4) in the introduction which we believe caused confusion regarding the aim.

L108 -which species hold these MISE? Which is the average MISE number per bacterial genome?
Authors' reply: We thank the reviewer for the comment and have, based on the reviewer's suggestion, searched all genomes present in NCBI RefSeq for the MISE. We found only 223 unique species that carried a MISE (there were 7283 unique species present in RefSeq), and that Proteobacterial species were overrepresented carriers of MISE. Furthermore, we found that in general, species carrying MISE were often more closely related to the origin species in which the MISE was associated with mobilization than to other origin species. We have extended the results section with lines 234-245 and supplementary figure 6 to describe these results.
How different is that average number in different environments? This information might be useful for a better understanding of the data and the results.

Authors' reply:
Since the data we analysed consisted of metagenomic fragments we cannot give a reliable estimate of the number of MISE per carrier, instead we refer the reviewer to figure 3 which shows the average relative abundance of MISE in each environment (fig 3a) and relative abundance of all IS-elements in each environment (fig 3b).
L115 and following -Since the methods are the last section of the paper, there should be a better description of procedures in the results section, to allow the reader to follow a rationale. For instance, it would be beneficial to have a brief explanation of the database and the optimization procedure, the criterion to select "origin species", etc..

Authors' reply:
We thank the reviewer for the comment and have, as mentioned above, extended the introduction and added new paragraphs to both results and methods. With regards to "selecting origin species" we did include all known origin species with complete sequenced genomes (specified in Table 1) that fulfil the criteria with regards to sufficient evidence, as lined out in Ebmeyer et al (2021).

L117 -It seems that this validation may have some bias, since with genomes the probability of unspecific identification is much lower than in a metagenome data set. Was this normalized taking such bias into account?
Authors' reply: The evaluation of the method was done using simulated fragmented data from known genomes. This is necessary since we need to have full knowledge about the content of the simulated data in order to estimate the error rates. We have now clarified this on several locations in the text (lines 131-132,134, 137). See also the results from the validation on complex metagenomic samples generated by CAMISIM (supplementary table 2).

Authors' reply:
We believe that the reviewer is referring to the number of included studies, not the number of metagenomes. The number of metagenomes included in our analysis is 2496 (totalling more than 22 trillion bases). This is a very substantial number given the current standards of the field (although there are some studies that are even larger).
it seems that the relative abundance of these genetic elements is reduced during wastewater treatment. According with these data, and to support the hypothesis raised in the title of manuscript, which would be the preferential niche for antibiotic resistance acquisition? The plumbing system? Authors' reply: Yes, this is correct. We have in many places throughout the manuscript changed the word WWTP to wastewaters.
In general, there is an important bias for proteobacteria in wastewater, so the conclusions are strongly influenced by this fact. It is assumed that because wastewater have higher abundance of Proteobacteria than other environments, they are probably acquiring novel genes in that environment. Although this is a logical argument, it seems to lack strong evidence. When wastewater is compared with human stool, it might be more accurate if one considers the ratio "origin species": members of the same phylum in that environment. Otherwise, all data is strongly biased by abundance.

Authors' reply:
What is interesting is the relative abundance of origin species (and MISE), not the taxonomic distribution of all the bystanders in the communities. The reasons for why the abundance of an origin species/MISE might be higher in one environment than another may overlap with reasons for why abundance of proteobacteria in general is higher. However, and this is very important, why the origin species/MISE are more abundant in one environment compared to another does not matter for the conclusion on the probability that they will be able to interact and become genetically associated. We simply hypothesize that presence and high abundance of origin species/MISE in a given environment is to some extent linked to higher risk for them to interact (given everything else alike).
We would again also like to clarify that we are not comparing abundances of all species (only the investigated origin species) and we do not, in contrast to what the reviewer says, assume that Proteobacteria is acquiring novel genes in wastewaters. We do, however, hypothesise that the origin species, which all are proteobacteria, might have acquired MISE in wastewaters.

R2: Summary
The authors use a set of species to look for them in public metagenomic data from different environments. The metagenome analysis shows that wastewaters and human stools have a high abundance of the set of species. They analyze some ISs in the set of species and find that IS content is the highest in hospital effluent, poultry faeces and WWTP. Their results suggest that residual water could be a site for the mobilization of antibiotic-resistance genes. However, there are important issues that need to be addressed before this study can be considered for publication. Elements (i.e. plasmids, phages, pathogenicity islands, etc.). Considering this issue, the title and abstract are misleading. Given what is stated in the introduction, specifically lines 40-47, I assume they have already conducted the analysis linking ISs to ARGs in the "origin" species in a previous study, if so please say that explicitly.

There is an important piece of data or analysis missing. The authors did not present the data where one can see that the ISs are carrying antibiotic resistance genes (ARGs). The authors conducted analyses showing the presence or absence of the ISs in the origin species but it's not shown if these ISs have ARGs. Equally important, the authors basically just analyze ISs and not a good diversity of Mobile Genetic
Authors' reply: Unfortunately, the reviewer seems to have largely misinterpreted the rationale of this study. We investigate already identified origin species, identified as origins of ARGs in previous studies and outlined by Ebmeyer et al. (Ebmeyer et al., 2021). The mobilization of ARG(s) from these species have in many cases been associated (in previous studies by other researchers) with certain IS-elements. We are therefore not interested in if the investigated IS-elements carry an ARG or not, but merely look for their presence, since they would likely have been required to be in the same physical environment as the origin species for the mobilization of the respective ARG to have happened. We are analysing only IS-elements because we are investigating the 22 origin species from which there is already published evidence that mobilization of specific ARGs likely has happened with the help of specific IS-elements.
Since we have noted that both reviewer 1 and reviewer 2 have misunderstood the concept of origin species and therefore also the concept of the study, we have removed one potentially confusing paragraph in the introduction as well as expanded the introduction, results and methods with new paragraphs and explanations about origin species and our aims with this study. See also answers to reviewer 1. Table 1 as their origin species? Please give the reason behind it. There are some ESKAPE pathogens but also some other bacterial species that their clinical relevance is not obvious. Again, I think the authors used the set of species considered in a previous study by them.

Authors' reply:
We have included all species with available complete genomes that have been shown, in previous studies largely by other authors, to be the recent evolutionarily origin of different ARGs. The list of all previously confirmed origin species, and the evidence for these being the origin of specific ARGs, can by found in a published article by Ebmeyer et al. (Ebmeyer et al., 2021) where the current knowledge within the field was summarized. We have clarified this on several locations in the manuscript. See also answers to comment 1 and reviewer 1.

For reproducibility sake, the accession number (or BioSample number) should be provided for all the genomes downloaded from the NCBI assembly site -this can be added a supplementary table.
Please, also mention in the text how many genomes were included for each origin species. This is important to know how much information was included for each origin species.

Authors' reply:
We have now added supplementary data containing the accession number for the downloaded genomes as well as number of genomes included for each origin species and we are referencing this data on lines 123, 224 and 487. Since we are analysing 22 origin species, we do not, for the readability of the text, want to write out the number of included genomes for each species in the text but instead reference to the supplementary data. Though we now have included the number of genomes for each origin species, we argue that the extra value gained from knowing exactly how many genomes of the origin species that were included is limited. Instead, the results from the validation of the method contain detailed information regarding number of genomes tested, both in and not in the database together with the estimates of the method's performance.

The use of the word origin (of ARGs) is sometimes elusive. For instance in line 81. Do you mean the origin of a new allelic variant? Clearly, you do not mean the origin of the whole gene families. Please define more precisely what you mean by origin. Maybe you want to say the "most recent origin".
Authors' reply: The definition of gene families is different for different ARG classes, and the names of ARGs does not always, unfortunately, follow the definition for a gene family of the respective class (Bush & Jacoby, 2010;Jacoby et al., 2008). However, if one classifies a group of ARGs with the same name as the same gene family, we do in some instances mean whole gene families (see (Ebmeyer et al., 2021)). For instance, the beta-lactamase "gene family" FOX, (including all mobile allelic variants in that group, i.e. FOX-1, FOX-2, FOX-3 and so on) has been recently mobilized from Aeromonas allosaccharophila, while the "gene family" SHV was mobilized from Klebsiella pneumonia etc. However, since the definition of gene family is varying and is not always coherent with the names of the ARGs, we have abstained from introducing the concept in the manuscripts since we are a) not investigating the ARGs themselves, and b) we believe it will only confuse the reader.
Instead we refer the reader to Ebmeyer et al. to read more about the different ARGs with a known origin species and we have added lines 39-48 where we explain the definition and concept of origin species. We have also changed the wording in line 79 to make it more coherent with our previous and later use of "origin species".

Minor comments
The discussion is rather long and, in some places, rather redundant.

Authors' reply:
We agree with the reviewer and have removed the whole second to last paragraph, which contained a lot of repetition of what had already been said.
Phages have also shown to be agents mobilizing ARGs, see references below https://pubmed.ncbi.nlm.nih.gov/32109174/ https://pubmed.ncbi.nlm.nih.gov/34232073/ Authors' reply: This study is focusing on the 22 origin species from which certain ARGs most likely have been recently mobilized with the help of IS-elements. Although phages can be involved in the horizontal transfer of ARGs, we have not yet seen conclusive evidence pointing to a single ARG which has been recently mobilized from a defined taxa with the help of phage(s). Of course, this doesn't exclude that phages might have mobilized ARGs (making a previously largely immobile chromosomal ARG more easily movable), but we are using a list of confirmed origin species. The first article referenced above describes predicted prophages in A. baumannii genomes, and conclude that many of the predicted prophages encode ARGs. However, they cannot determine if they contribute to ARG spread, and even less so if the finding is related to mobilization of ARGs. The second article referenced aimed to investigate the distribution of ARGs and virulence factor genes within prophages in seven pathogens. Here they discover that some prophages can harbour many ARGs and that these ARGs were located near mobile genetic elements. Again, although it is possible that phages play a role in horizontal transfer of ARGs and/or mobilization of ARGs, none of the provided articles show any conclusive evidence of mobilization.

Lines 454, 465 and 394: should be genus (the singular of genera).
Authors' reply: We thank the reviewer for pointing this out and have made changes accordingly.
Lines 455-457: why did you do that with the sequence annotated as plasmids?
Authors' reply: Since most plasmids can be easily exchanged between species we cannot accurately assign fragments belonging to a plasmid to a certain species. We have added lines 487-491 to motivate our decision to create a separate plasmid category.
Lines 462-465: Please provide more details about the simulated reads for reproducibility's sake.

Authors' reply:
We thank the reviewer for pointing this out and have added more details (line 498).
The coverage for the IS identification ("-query-cover 50") is a bit low. I would suggest at least 60% or even 70% to be sure that you covering most of the gene and not just 50% Authors' reply: Query coverage refers to the coverage of the sequences which are compared to the database (which contains the subject sequences), in our case the query sequences therefore are the metagenomic fragments (and not as the reviewer suggest, the genes). Since the shortest transposases within the IS-elements in our reference database are around 80bp, and our fragments range from 75-150 bp, our query-coverage was set so that the longest fragments (i.e. 150 bp) would have a chance to pass our criteria for the shortest transposases. If we would set the query cut-off to 60%, then we would require 90 bp of our 150bp long fragment to match, and this would be impossible for the shorter transposases.
Please provide references for ggplot2, phyloseq and vegan.
Authors' reply: We thank the reviewer for pointing this out and have added the references.
Reviewer #4 (Remarks to the Author): The authors present a novel bioinformatic analysis approach that was developed to identify environments that are likely the origin (or hot spots) for the mobilization of known ARGs that were not mobile before. The authors used a customized database that is composed of origin species for specific ARGs that were non mobile and a priori present on the chromosome of these organisms, and later mobilized by the association of insertion sequences. The authors developed an approach to identify and define the origin species in a previous study, which is essential to read to understand the concept and the analysis approach.
Overall, the authors present very interesting and novel findings, by identifying and highlighting wastewater environments as putative hotspots for past and future mobilization evens for ARGs. This was achieved by detecting the abundance of the defined origin species and their association to mobilizing insertion sequences (MISE) in metagenomic datasets from wastewater compared to soil, water, human and animal samples. The approach is novel. The authors did also correlate the presence of origin species and MISE with the human faecal marker crAssphage in their datasets. Surprisingly, a high correlation to origin species and MISE pairs was found for the presence of crAssphage in all samples but human stool.

Introduction
Line 50/51: reference 16: to my understanding this reference makes no claim supporting this statement. This article highlights that the development of methicillin resistance predates the clinical use of beta-lactams, not that the origin species evolved so far that it cannot be identified anymore. Maybe the authors could rephrase and explain what they meant by this statement-or whether their statement is an interpretation of the results / study they are referring to.

Authors' reply:
The reviewer is correct that the statement on line 51-52 was an interpretation of the cited research. Indeed, to show that an ARG has been mobilized from a not presently existing species would be very difficult, if not impossible. We have therefore removed the reference, but are keeping the conceptual statement which still holds (line 57).

Lines 69 to 79 are based on 2 review papers and the authors hypothesis only. It does also not consider natural lifestyle of bacteria (biofilms), and the role of free and external DNA and naturally competent cells in complex environments (neither in the environment studied)
Authors' reply: We have completely removed the paragraph in which the referenced lines were located. We did this because in this paragraph we were describing a later step of the chain, i.e. HGT, and not when an ARG initially becomes mobile (associated with an IS-element), and we believe that this was one of the reasons that other reviewers misinterpreted the aims of our study. We have however modified the article on other places to also include free and external DNA as a potential source of MISE, see lines 101, 244. In addition, we have previously taken the possibility of external DNA into consideration when discussing the initial association (see lines 308 and 390).
It is true that other factors such as biofilms could affect the potential for a mobilization and/or transfer of an ARG. However, we have in this study chosen to focus on the environments on a larger scale such as sewers, soil, waters etc, and not considered the microenvironments within these larger categories. We have added a sentence in the discussion to clarify that biofilms is an aspect we have not considered in our analysis (lines 381-382).
The first paragraph of the results section is very difficult to understand and seems to fit better into the methods section.
The definition of origin species is not clear and requires the reader to fully study the previously published work by the same authors (reference 15). I would suggest to re-phrase this paragraph and to explain how the origin species are defined.

Authors' reply:
We thank the reviewer for this comment and have written a new first paragraph of the results section explaining on what criteria we chose the origin species included in this study. This paragraph also includes a non-technical description about the database creation. We have also rewritten the original first paragraph describing the results from the database evaluation. In addition, we have extended the second paragraph of the introduction with more information about the concept of origin species, including a clear definition.
The optimized database should be explained (which criteria) and referenced.
Authors' reply: In addition to the newly written first paragraph of the Results section (see answer above), the database creation and evaluation is explained in paragraph 2, 3 and 4 in the Methods section. We have extended all these paragraphs with more details and explanations (see answer below). The result from the evaluation is available in supplementary table 1, 2, and 3. In addition, we have added a statement in "data availability" saying that the database is available upon request.
The authors do not describe clearly how small fragments of 100 to 150bp from metagenomic sequence data are enough to identify as the origin species (what type of fragments? Filtered 16S rRNA gene fragments? Authors' reply: We thank the reviewer for pointing this out. The foundation of the method is the software Kraken2 with optimized parameters, together with a custom-made database optimized for the identification of the 22 origin species we investigate. Kraken2 operates directly on reads using a k-mer based approach, i.e. not just 16S rRNA gene fragments (Wood et al., 2019, p. 2). To explain this, we have added a line in the newly written first paragraph of the results (line 120) as well as added a sentence on lines 471-477 in Methods.
Or several reads mapping on different part of the genome of the origin species? How is that specific enough to allow the identification of a species based on small reads? It could be a fragment that originates from another species (the core genome for example) or genus closely related). It would be great if the authors could explain how they identified / classified short metagenomic sequencing reads on species level. It seems like a truly innovative and valuable approach; however, it is very difficult to grasp (also when reading their other paper in which they curated the database of origin species) the way it is written.

Authors' reply:
The reviewer is correct in that it is infamously difficult to classify short reads at species level. That is why we have created a custom database optimized to correctly classify reads originating from specifically the 22 investigated origin species, while not classifying reads from other species, especially from closely related genera, as belonging to any of our origin species (i.e., the database is not optimized to correctly identify other species). We extended the default Kraken2 database (containing bacterial, human, plasmid, known vector and viral genomes) with many manually confirmed genomes from genera closely related to each origin species, in order to avoid false classification of reads coming from these. In essence, that means that only discriminatory regions of the DNA are used for classification, while those that overlap between closely related species are not used for classification to the species level. In addition, we chose the "confidence" parameter of kraken2 to be 0.3 (default is 0). Simplified, this means that if uncertainty is too high, the read will not be classified at species level. Therefore we have a low proportion of reads being assigned at all at species level (see supplementary table 2 and 3), as, for this study, we decided it is better to discard data than to falsely classify it as an origin species. It is therefore important to note that we cannot measure absolute abundance of an origin species in a certain environment, but we can compare the abundances within species across samples.
In addition, the method was rigorously tested. As the reviewer pointed out, one would suspect that reads coming from a closely related genus will have the highest risk of being falsely classified as an origin species, especially if this species was not included in the database (i.e. we have no knowledge about this genome). Therefore, one of our evaluation strategies was to test the method on fragmented genomes from closely related genera, where the genomes used for testing were not included in our database (see supplementary table 3). The method was also tested on fragmented genomes from our 22 origin species (genomes both included and not included in the database) as well as complex metagenomic samples generated by Critical Assessment of Metagenome Interpretation (CAMI) consortium (Sczyrba et al., 2017). This is now further explained in the first and second paragraph of results, and we have extended paragraph 2 and 3 of the method section. In addition, the performance of the method is discussed in paragraph 7 of the Discussion section.
I would suggest an extended method section reintroducing the method, the theory, the definition of origin species.
Authors' reply: We thank the reviewer for this comment and have added a new paragraph in the beginning of the method section where we again introduce the concept of origin species and mobilizing IS elements as well as our rationale for choosing the 22 origin species investigated in this study. In addition we have extended paragraph 2 and 3 to include more details and explaining concepts about the method creation and evaluation.
Technical and statistical details as to validating their data by using TPR and FPR should be in supplementary data.

Authors' reply:
We have shortened the paragraph in Results which reports the TPR and FPR (second paragraph) by removing details. However, we believe that it is of importance to keep the summarised results of the evaluation as they give confidence to the method on which many of the results in the article rely on.